{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "83fa4ca4",
   "metadata": {},
   "source": [
    "# Preface\n",
    "\n",
    "This book is an introduction to the practical tools of exploratory data\n",
    "analysis. The organization of the book follows the process I use when I\n",
    "start working with a dataset:\n",
    "\n",
    "-   Importing and cleaning: Whatever format the data is in, it usually\n",
    "    takes some time and effort to read the data, clean and transform it,\n",
    "    and check that everything made it through the translation process\n",
    "    intact.\n",
    "\n",
    "-   Single variable explorations: I usually start by examining one\n",
    "    variable at a time, finding out what the variables mean, looking at\n",
    "    distributions of the values, and choosing appropriate summary\n",
    "    statistics.\n",
    "\n",
    "-   Pair-wise explorations: To identify possible relationships between\n",
    "    variables, I look at tables and scatter plots, and compute\n",
    "    correlations and linear fits.\n",
    "\n",
    "-   Multivariate analysis: If there are apparent relationships between\n",
    "    variables, I use multiple regression to add control variables and\n",
    "    investigate more complex relationships.\n",
    "\n",
    "-   Estimation and hypothesis testing: When reporting statistical\n",
    "    results, it is important to answer three questions: How big is the\n",
    "    effect? How much variability should we expect if we run the same\n",
    "    measurement again? Is it possible that the apparent effect is due to\n",
    "    chance?\n",
    "\n",
    "-   Visualization: During exploration, visualization is an important\n",
    "    tool for finding possible relationships and effects. Then if an\n",
    "    apparent effect holds up to scrutiny, visualization is an effective\n",
    "    way to communicate results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e73a22e0",
   "metadata": {},
   "source": [
    "This book takes a computational approach, which has several advantages\n",
    "over mathematical approaches:\n",
    "\n",
    "-   I present most ideas using Python code, rather than mathematical\n",
    "    notation. In general, Python code is more readable; also, because it\n",
    "    is executable, readers can download it, run it, and modify it.\n",
    "\n",
    "-   Each chapter includes exercises readers can do to develop and\n",
    "    solidify their learning. When you write programs, you express your\n",
    "    understanding in code; while you are debugging the program, you are\n",
    "    also correcting your understanding.\n",
    "\n",
    "-   Some exercises involve experiments to test statistical behavior. For\n",
    "    example, you can explore the Central Limit Theorem (CLT) by\n",
    "    generating random samples and computing their sums. The resulting\n",
    "    visualizations demonstrate why the CLT works and when it doesn't.\n",
    "\n",
    "-   Some ideas that are hard to grasp mathematically are easy to\n",
    "    understand by simulation. For example, we approximate p-values by\n",
    "    running random simulations, which reinforces the meaning of the\n",
    "    p-value.\n",
    "\n",
    "-   Because the book is based on a general-purpose programming language\n",
    "    (Python), readers can import data from almost any source. They are\n",
    "    not limited to datasets that have been cleaned and formatted for a\n",
    "    particular statistics tool."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dd02518",
   "metadata": {},
   "source": [
    "The book lends itself to a project-based approach. In my class, students\n",
    "work on a semester-long project that requires them to pose a statistical\n",
    "question, find a dataset that can address it, and apply each of the\n",
    "techniques they learn to their own data.\n",
    "\n",
    "To demonstrate my approach to statistical analysis, the book presents a\n",
    "case study that runs through all of the chapters. It uses data from two\n",
    "sources:\n",
    "\n",
    "-   The National Survey of Family Growth (NSFG), conducted by the U.S.\n",
    "    Centers for Disease Control and Prevention (CDC) to gather\n",
    "    \"information on family life, marriage and divorce, pregnancy,\n",
    "    infertility, use of contraception, and men's and women's health.\"\n",
    "    (See <http://cdc.gov/nchs/nsfg.htm>.)\n",
    "\n",
    "-   The Behavioral Risk Factor Surveillance System (BRFSS), conducted by\n",
    "    the National Center for Chronic Disease Prevention and Health\n",
    "    Promotion to \"track health conditions and risk behaviors in the\n",
    "    United States.\" (See <http://cdc.gov/BRFSS/>.)\n",
    "\n",
    "Other examples use data from the IRS, the U.S. Census, and the Boston\n",
    "Marathon."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8094c263",
   "metadata": {},
   "source": [
    "This second edition of *Think Stats* includes the chapters from the\n",
    "first edition, many of them substantially revised, and new chapters on\n",
    "regression, time series analysis, survival analysis, and analytic\n",
    "methods. The previous edition did not use pandas, SciPy, or StatsModels,\n",
    "so all of that material is new."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de8fd90c",
   "metadata": {},
   "source": [
    "## How I wrote this book\n",
    "\n",
    "When people write a new textbook, they usually start by reading a stack\n",
    "of old textbooks. As a result, most books contain the same material in\n",
    "pretty much the same order.\n",
    "\n",
    "I did not do that. In fact, I used almost no printed material while I\n",
    "was writing this book, for several reasons:\n",
    "\n",
    "-   My goal was to explore a new approach to this material, so I didn't\n",
    "    want much exposure to existing approaches.\n",
    "\n",
    "-   Since I am making this book available under a free license, I wanted\n",
    "    to make sure that no part of it was encumbered by copyright\n",
    "    restrictions.\n",
    "\n",
    "-   Many readers of my books don't have access to libraries of printed\n",
    "    material, so I tried to make references to resources that are freely\n",
    "    available on the Internet.\n",
    "\n",
    "-   Some proponents of old media think that the exclusive use of\n",
    "    electronic resources is lazy and unreliable. They might be right\n",
    "    about the first part, but I think they are wrong about the second,\n",
    "    so I wanted to test my theory."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f859a10",
   "metadata": {},
   "source": [
    "The resource I used more than any other is Wikipedia. In general, the\n",
    "articles I read on statistical topics were very good (although I made a\n",
    "few small changes along the way). I include references to Wikipedia\n",
    "pages throughout the book and I encourage you to follow those links; in\n",
    "many cases, the Wikipedia page picks up where my description leaves off.\n",
    "The vocabulary and notation in this book are generally consistent with\n",
    "Wikipedia, unless I had a good reason to deviate. Other resources I\n",
    "found useful were Wolfram MathWorld and the Reddit statistics forum,\n",
    "<http://www.reddit.com/r/statistics>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "713bf754",
   "metadata": {},
   "source": [
    "## Using the code\n",
    "\n",
    "The code and data used in this book are available from\n",
    "<https://github.com/AllenDowney/ThinkStats2>. Git is a version control\n",
    "system that allows you to keep track of the files that make up a\n",
    "project. A collection of files under Git's control is called a\n",
    "**repository**. GitHub is a hosting service that provides storage for\n",
    "Git repositories and a convenient web interface.\n",
    "\n",
    "The GitHub homepage for my repository provides several ways to work with\n",
    "the code:\n",
    "\n",
    "-   You can create a copy of my repository on GitHub by pressing the\n",
    "    Fork button. If you don't already have a GitHub account, you'll need\n",
    "    to create one. After forking, you'll have your own repository on\n",
    "    GitHub that you can use to keep track of code you write while\n",
    "    working on this book. Then you can clone the repo, which means that\n",
    "    you make a copy of the files on your computer.\n",
    "\n",
    "-   Or you could clone my repository. You don't need a GitHub account to\n",
    "    do this, but you won't be able to write your changes back to GitHub.\n",
    "\n",
    "-   If you don't want to use Git at all, you can download the files in a\n",
    "    Zip file using the button in the lower-right corner of the GitHub\n",
    "    page."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7279831",
   "metadata": {},
   "source": [
    "All of the code is written to work in both Python 2 and Python 3 with no\n",
    "translation.\n",
    "\n",
    "I developed this book using Anaconda from Continuum Analytics, which is\n",
    "a free Python distribution that includes all the packages you'll need to\n",
    "run the code (and lots more). I found Anaconda easy to install. By\n",
    "default it does a user-level installation, not system-level, so you\n",
    "don't need administrative privileges. And it supports both Python 2 and\n",
    "Python 3. You can download Anaconda from\n",
    "<http://continuum.io/downloads>.\n",
    "\n",
    "If you don't want to use Anaconda, you will need the following packages:\n",
    "\n",
    "-   pandas for representing and analyzing data,\n",
    "    <http://pandas.pydata.org/>;\n",
    "\n",
    "-   NumPy for basic numerical computation, <http://www.numpy.org/>;\n",
    "\n",
    "-   SciPy for scientific computation including statistics,\n",
    "    <http://www.scipy.org/>;\n",
    "\n",
    "-   StatsModels for regression and other statistical analysis,\n",
    "    <http://statsmodels.sourceforge.net/>; and\n",
    "\n",
    "-   matplotlib for visualization, <http://matplotlib.org/>.\n",
    "\n",
    "Although these are commonly used packages, they are not included with\n",
    "all Python installations, and they can be hard to install in some\n",
    "environments. If you have trouble installing them, I strongly recommend\n",
    "using Anaconda or one of the other Python distributions that include\n",
    "these packages.\n",
    "\n",
    "After you clone the repository or unzip the zip file, you should have a\n",
    "folder called `ThinkStats2/code` with a file called nsfg.py. If you run\n",
    "nsfg.py, it should read a data file, run some tests, and print a message\n",
    "like, \"All tests passed.\" If you get import errors, it probably means\n",
    "there are packages you need to install."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26b2091f",
   "metadata": {},
   "source": [
    "Most exercises use Python scripts, but some also use the IPython\n",
    "notebook. If you have not used IPython notebook before, I suggest you\n",
    "start with the documentation at\n",
    "<http://ipython.org/ipython-doc/stable/notebook/notebook.html>.\n",
    "\n",
    "I wrote this book assuming that the reader is familiar with core Python,\n",
    "including object-oriented features, but not pandas, NumPy, and SciPy. If\n",
    "you are already familiar with these modules, you can skip a few\n",
    "sections.\n",
    "\n",
    "I assume that the reader knows basic mathematics, including logarithms,\n",
    "for example, and summations. I refer to calculus concepts in a few\n",
    "places, but you don't have to do any calculus.\n",
    "\n",
    "If you have never studied statistics, I think this book is a good place\n",
    "to start. And if you have taken a traditional statistics class, I hope\n",
    "this book will help repair the damage."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69d62454",
   "metadata": {},
   "source": [
    "## Contributor List\n",
    "\n",
    "If you have a suggestion or correction, please send email to\n",
    "`downey@allendowney.com`. If I make a change based on your feedback, I\n",
    "will add you to the contributor list (unless you ask to be omitted).\n",
    "\n",
    "If you include at least part of the sentence the error appears in, that\n",
    "makes it easy for me to search. Page and section numbers are fine, too,\n",
    "but not quite as easy to work with. Thanks!\n",
    "\n",
    "-   Lisa Downey and June Downey read an early draft and made many\n",
    "    corrections and suggestions.\n",
    "\n",
    "-   Steven Zhang found several errors.\n",
    "\n",
    "-   Andy Pethan and Molly Farison helped debug some of the solutions,\n",
    "    and Molly spotted several typos.\n",
    "\n",
    "-   Dr. Nikolas Akerblom knows how big a Hyracotherium is.\n",
    "\n",
    "-   Alex Morrow clarified one of the code examples.\n",
    "\n",
    "-   Jonathan Street caught an error in the nick of time.\n",
    "\n",
    "-   Many thanks to Kevin Smith and Tim Arnold for their work on plasTeX,\n",
    "    which I used to convert this book to DocBook.\n",
    "\n",
    "-   George Caplan sent several suggestions for improving clarity.\n",
    "\n",
    "-   Julian Ceipek found an error and a number of typos.\n",
    "\n",
    "-   Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent\n",
    "    Johnson found errors in the first print edition.\n",
    "\n",
    "-   Jörg Beyer found typos in the book and made many corrections in the\n",
    "    docstrings of the accompanying code.\n",
    "\n",
    "-   Tommie Gannert sent a patch file with a number of corrections.\n",
    "\n",
    "-   Christoph Lendenmann submitted several errata.\n",
    "\n",
    "-   Michael Kearney sent me many excellent suggestions.\n",
    "\n",
    "-   Alex Birch made a number of helpful suggestions.\n",
    "\n",
    "-   Lindsey Vanderlyn, Griffin Tschurwald, and Ben Small read an early\n",
    "    version of this book and found many errors.\n",
    "\n",
    "-   John Roth, Carol Willing, and Carol Novitsky performed technical\n",
    "    reviews of the book. They found many errors and made many helpful\n",
    "    suggestions.\n",
    "\n",
    "-   David Palmer sent many helpful suggestions and corrections.\n",
    "\n",
    "-   Erik Kulyk found many typos.\n",
    "\n",
    "-   Nir Soffer sent several excellent pull requests for both the book\n",
    "    and the supporting code.\n",
    "\n",
    "-   GitHub user flothesof sent a number of corrections.\n",
    "\n",
    "-   Toshiaki Kurokawa, who is working on the Japanese translation of\n",
    "    this book, has sent many corrections and helpful suggestions.\n",
    "\n",
    "-   Benjamin White suggested more idiomatic Pandas code.\n",
    "\n",
    "-   Takashi Sato spotted a code error.\n",
    "\n",
    "Other people who found typos and similar errors are Andrew Heine, Gábor\n",
    "Lipták, Dan Kearney, Alexander Gryzlov, Martin Veillette, Haitao Ma,\n",
    "Jeff Pickhardt, Rohit Deshpande, Joanne Pratt, Lucian Ursu, Paul Glezen,\n",
    "Ting-kuang Lin, Scott Miller, Luigi Patruno.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
